Python: Foundry Evals integration for Python by alliscode · Pull Request #4750 · microsoft/agent-framework

alliscode · 2026-03-17T21:15:01Z

Add evaluation framework with local and Foundry-hosted evaluator support:

EvalItem/EvalResult core types with conversation splitting strategies
@evaluator decorator for defining custom evaluation functions
LocalEvaluator for running evaluations locally
FoundryEvals provider for Azure AI Foundry hosted evaluations
evaluate_agent() orchestration with expected values support
evaluate_workflow() for multi-agent workflow evaluation
Comprehensive test suite and evaluation samples

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

python/packages/core/agent_framework/_evaluation.py

python/packages/core/agent_framework/__init__.py

python/packages/core/agent_framework/_eval.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

markwallace-microsoft · 2026-03-19T20:44:50Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/azure-ai/agent_framework_azure_ai
_foundry_evals.py	250	5	98%	273, 324–325, 446, 657
packages/core/agent_framework
_evaluation.py	645	77	88%	228, 243, 460, 462, 570, 573, 652–654, 659, 696–699, 755–756, 759, 765–767, 771, 804–806, 860, 895, 907–909, 914, 938–943, 1034, 1112–1113, 1115–1119, 1125, 1163, 1508, 1510, 1518, 1528, 1532, 1577, 1595–1596, 1628–1631, 1707, 1713, 1728, 1732–1734, 1764, 1770–1774, 1808, 1856–1857, 1859, 1884–1885, 1890
TOTAL	28907	3494	87%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5660	20 💤	0 ❌	0 🔥	1m 38s ⏱️

Merged and refactored eval module per Eduard's PR review: - Merge _eval.py + _local_eval.py into single _evaluation.py - Convert EvalItem from dataclass to regular class - Rename to_dict() to to_eval_data() - Convert _AgentEvalData to TypedDict - Simplify check system: unified async pattern with isawaitable - Parallelize checks and evaluators with asyncio.gather - Add all/any mode to tool_called_check - Fix bool(passed) truthy bug in _coerce_result - Remove deprecated function_evaluator/async_function_evaluator aliases - Remove _MinimalAgent, tighten evaluate_agent signature - Set self.name in __init__ (LocalEvaluator, FoundryEvals) - Limit FoundryEvals to AsyncOpenAI only - Type project_client as AIProjectClient - Remove NotImplementedError continuous eval code - Add evaluation samples in 02-agents/ and 03-workflows/ - Update all imports and tests (167 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use cast(list[Any], x) with type: ignore[redundant-cast] comments to satisfy both mypy (which considers casting Any redundant) and pyright strict mode (which needs explicit casts to narrow Unknown types). Also fix evaluator decorator check_name type annotation to be explicitly str, resolving mypy str|Any|None mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@overload

…attr - Apply pyupgrade: Sequence from collections.abc, remove forward-ref quotes - Add @overload signatures to evaluator() for proper @evaluator usage - Fix evaluate_workflow sample to use WorkflowBuilder(start_executor=) API - Fix _workflow.py executor.reset() to use getattr pattern for pyright - Remove unused EvalResults forward-ref string in default_factory lambda Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The test_configure_otel_providers_with_env_file_and_vs_code_port test triggers gRPC OTLP exporter creation, but the grpc dependency is optional and not installed by default. Add skipif decorator matching the pattern used by all other gRPC exporter tests in the same file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Move module docstrings before imports (after copyright header) - Add -> None return type to all main() and helper functions - Fix line-too-long in multiturn sample conversation data - Add Workflow import for typed return in all_patterns_sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nings - Simplify _ensure_async_result to direct await (async-only clients) - Replace get_event_loop() with get_running_loop() - Narrow _fetch_output_items exception handling to specific types - Add warning log when _filter_tool_evaluators falls back to defaults - Add DeprecationWarning to options alias in Agent.__init__ - Add DeprecationWarning to evaluate_response() - Rename raw key to _raw_arguments in convert_message fallback - Fix evaluate_agent_sample.py: replace evals.select() with FoundryEvals() - Fix evaluate_multiturn_sample.py: use Message/Content/FunctionTool types - Fix evaluate_workflow_sample.py: replace evals.select() with FoundryEvals() - Update test mocks to use AsyncMock for awaited API calls Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add num_repetitions=2 positive test verifying 2×items and 4 agent calls - Add _poll_eval_run tests: timeout, failed, and canceled paths - Add evaluate_traces tests: validation error, response_ids path, trace_ids path - Add evaluate_foundry_target happy-path test with target/query verification Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Wrap implicit string concatenation in parens in evaluate_multiturn_sample.py - Apply ruff formatter to 6 other files with minor formatting drift Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nch) Reverts changes to _agents.py, _agent_executor.py, and _workflow.py back to upstream/main. These fixes are now in a separate PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@evaluator

Code fixes: - Fix _normalize_queries inverted condition (single query now replicates to match expected_count) - Fix substring match bug: 'end' in 'backend' matched; use exact set lookup for executor ID filtering - Fix used_available_tools sample: tool_definitions→tools param, use FunctionTool attribute access instead of dict .get() - Add None-check in _resolve_openai_client for misconfigured project - Add Returns section to evaluate_workflow docstring - Cache inspect.signature in @evaluator wrapper (avoid per-item reflection) Architecture: - Extract _evaluate_via_responses as module-level helper; evaluate_traces now calls it directly instead of creating a FoundryEvals instance - Move Foundry-specific typed-content conversion out of core to_eval_data; core now returns plain role/content dicts, FoundryEvals applies AgentEvalConverter in _evaluate_via_dataset Tests: - evaluate_response() deprecation warning emission and delegation - num_repetitions > 1 with expected_output and expected_tool_calls - Mock output_items.list in test_evaluate_calls_evals_api - Update to_eval_data assertions for plain-dict format - Unknown param error now raised at @evaluator decoration time Skipped (separate PR): executor reset loop, xfail removal, options alias Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Revert test_full_conversation.py to upstream/main (the session preservation test was incorrectly changed to assert clearing) - Fix pyright reportUnnecessaryComparison on get_openai_client() None check by adding ignore comment - Fix pyright reportPrivateUsage: add public EvalItem.split_messages() method and use it in FoundryEvals._evaluate_via_dataset instead of accessing private _split_conversation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add try/except guard for non-numeric score in _coerce_result - Add poll_interval minimum bound (0.1s) to prevent tight loops - Add runtime async client check in _resolve_openai_client - Remove _ensure_async_result wrapper (10 call sites → direct await) - Better error message when queries provided without agent - Import-time asserts for evaluator set consistency - Remove 28 redundant @pytest.mark.asyncio decorators - Add doc note about _raw_arguments sensitive data - Tests: tool_called_check mode=any, _normalize_queries branches, _extract_result_counts paths, _extract_per_evaluator, bare check via evaluate_agent, output_items assertion, modulo wrapping, async client check, queries-without-agent error Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace module-level assert with if/raise for evaluator set consistency checks (ruff S101 disallows bare assert) - Add type: ignore[arg-type] and pyright: ignore[reportArgumentType] on OpenAI SDK evals API calls that pass dicts where typed params are expected (SDK accepts dicts at runtime) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix all_passed ignoring parent result_counts when sub_results present - Fix _extract_tool_calls: parse string arguments via json.loads before falling back to None (real LLM responses use string arguments) - Sanitize _raw_arguments to '[unparseable]' to avoid leaking sensitive tool-call data to external evaluation services - Add NOTE comment on to_eval_data message serialization dropping non-text content (tool calls, results) - Eliminate double conversation split in _evaluate_via_dataset: build JSONL dicts directly from split_messages + AgentEvalConverter - Raise poll_interval floor from 0.1s to 1.0s to prevent rate-limit exhaustion - Fix MagicMock(name=...) bug in test: sets display name not .name attr - Fix mock_output_item.sample: use MagicMock object instead of dict so _fetch_output_items exercises error/usage/input/output extraction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Code fixes: - Move import-time RuntimeError checks to unit tests (avoids breaking imports for all users on developer set-drift mistake) - _filter_tool_evaluators now raises ValueError when all evaluators require tools but no items have tools (was silently substituting) - Add poll_interval upper bound (60s) to prevent single-iteration sleep - Log exc_info=True in _fetch_output_items for debugging API changes - Fix evaluate() docstring: remove claim about Responses API optimization - Validate target dict has 'type' key in evaluate_foundry_target - Document to_eval_data() limitation: non-text content is omitted Tests: - TestEvaluatorSetConsistency: verify _AGENT/_TOOL subsets of _BUILTIN - TestEvaluateTracesAgentId: agent_id-only path with lookback_hours - TestFilterToolEvaluatorsRaises: ValueError on all-tool no-items - TestEvaluateFoundryTargetValidation: target without 'type' key - Assert items==[] on failed/canceled poll results - Mock output_items.list in response_ids test for full flow - TestAllPassedSubResults: result_counts=None + sub_results delegation and parent failures override sub_results - TestBuildOverallItemEmpty: empty workflow outputs returns None Skipped r5-07 (_raw_arguments length hint): marginal debugging value, could leak content size information. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…s=...) The referenced function doesn't exist; the correct API is evaluate_traces(response_ids=...) from the azure-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove to_eval_data() from EvalItem (dead code after r4-05 JSONL refactor) - Migrate 15 tests from to_eval_data() to split_messages() - Update sample to use split_messages() + Message properties - Remove unimplemented Responses API optimization docstring claim - Update split_messages() docstring to not reference removed method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The method was never called — evaluate() uses _evaluate_via_dataset, and evaluate_traces() calls _evaluate_via_responses_impl directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…format - Remove import of non-existent _foundry_memory_provider module (incorrectly kept during rebase conflict resolution) - Apply ruff formatter to test_local_eval.py and get-started samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The upstream provider-leading client refactor (microsoft#4818) made client= a required parameter on Agent(). Update the three getting-started eval samples to use FoundryChatClient with FOUNDRY_PROJECT_ENDPOINT, matching the standard pattern from 01-get-started samples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace ~80 lines of manual OpenAI evals API code (create_eval, run_eval, manual polling, raw JSONL params) with FoundryEvals: - evaluate_groundedness() uses FoundryEvals.evaluate() with EvalItem - Remove create_openai_client(), create_eval(), run_eval() functions - Remove openai SDK type imports (DataSourceConfigCustom, etc.) - run_self_reflection_batch creates FoundryEvals instance once, reuses it for all iterations across all prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Migrate all foundry_evals samples from AzureOpenAIResponsesClient to FoundryChatClient - Update env var from AZURE_AI_PROJECT_ENDPOINT to FOUNDRY_PROJECT_ENDPOINT - Use AzureCliCredential consistently across all samples - Fix README.md: correct function names (evaluate_dataset -> FoundryEvals.evaluate, evaluate_responses -> evaluate_traces) - Update self_reflection .env.example and README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a provider-agnostic evaluation framework to the Python Agent Framework, with both local (no-API) evaluators and an Azure AI Foundry-backed provider, plus end-to-end samples that demonstrate agent and workflow evaluation patterns.

Changes:

Introduces core evaluation types and orchestration (EvalItem, EvalResults, evaluate_agent(), evaluate_workflow()) plus local checks (LocalEvaluator, @evaluator).
Adds Azure AI Foundry provider integration (FoundryEvals) and trace/target evaluation helpers.
Adds/updates evaluation samples (Foundry evals patterns + self-reflection groundedness) and expands test coverage for local evaluation.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py	Migrates groundedness scoring to `FoundryEvals` and updates batch runner.
python/samples/05-end-to-end/evaluation/self_reflection/README.md	Updates self-reflection sample documentation for Foundry Evals usage and env vars.
python/samples/05-end-to-end/evaluation/self_reflection/.env.example	Updates env var example to `FOUNDRY_PROJECT_ENDPOINT`.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py	New sample: evaluate multi-agent workflows with Foundry evaluators.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py	New sample: evaluate existing responses / traces via Foundry.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_multiturn_sample.py	New sample: demonstrate conversation split strategies for eval.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py	New sample: mix `LocalEvaluator` with Foundry evaluators in one call.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_all_patterns_sample.py	New “kitchen sink” sample covering all evaluation patterns.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py	New sample: evaluate_agent patterns + direct `FoundryEvals.evaluate()`.
python/samples/05-end-to-end/evaluation/foundry_evals/README.md	New README describing Foundry eval samples and entry points.
python/samples/05-end-to-end/evaluation/foundry_evals/.env.example	New env example for Foundry eval samples.
python/samples/03-workflows/evaluation/evaluate_workflow.py	New workflow evaluation sample using local checks.
python/samples/02-agents/evaluation/evaluate_with_expected.py	New sample demonstrating expected outputs/tool call expectations.
python/samples/02-agents/evaluation/evaluate_agent.py	New sample demonstrating basic local evaluation for agents.
python/packages/core/tests/core/test_observability.py	Adjusts OTLP exporter-related test skipping.
python/packages/core/tests/core/test_local_eval.py	Adds a comprehensive test suite for local eval framework behaviors.
python/packages/core/agent_framework/_evaluation.py	Adds the provider-agnostic evaluation framework implementation.
python/packages/core/agent_framework/init.py	Re-exports evaluation APIs/types from the package root.
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py	Adds the Foundry-backed `FoundryEvals` provider + trace/target helpers.
python/packages/azure-ai/agent_framework_azure_ai/init.py	Exposes `FoundryEvals` and helper functions from the azure-ai package.

python/packages/core/tests/core/test_observability.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

…nto af-foundry-evals-python

…jectClient AIProjectClient from azure.ai.projects.aio requires an async credential. Switch all foundry_evals samples from azure.identity.AzureCliCredential to azure.identity.aio.AzureCliCredential. Also pass project_client to FoundryChatClient instead of duplicating endpoint+credential. Close credential in self_reflection sample to avoid resource leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

moonbox3 · 2026-03-25T23:01:30Z

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

+
+Typical usage::
+
+    from agent_framework import evaluate_agent


code samples should be in the sphinx format like:

Example: .. code-block:: python from typing_extensions import Never from agent_framework import Executor, WorkflowBuilder, WorkflowContext, handler

Also, can we move the code sample from this top-level docstring to docstrings where the code is applicable?

Copilot on behalf of alliscode: Fixed in b593126 — converted all code examples in both _foundry_evals.py and _evaluation.py to use .. code-block:: python format, and moved the module-level examples into the applicable class/function docstrings (e.g. FoundryEvals class, LocalEvaluator class, evaluate_agent, evaluate_workflow, etc.).

moonbox3 · 2026-03-26T05:41:55Z

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

+    """
+    items: list[EvalItemResult] = []
+    try:
+        output_items_page = await client.evals.runs.output_items.list(


For eval runs with more items than the page size, are we dropping/truncating them? Don't we want async pagination here?

Copilot on behalf of alliscode: Fixed in b593126 — changed for oi in output_items_page: to async for oi in output_items_page: which leverages the OpenAI SDK's AsyncCursorPage auto-pagination. This automatically follows the cursor through all pages rather than only processing the first page.

moonbox3 · 2026-03-26T05:43:06Z

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

+        client = project_client.get_openai_client()
+        if client is None:  # pyright: ignore[reportUnnecessaryComparison]
+            raise ValueError("project_client.get_openai_client() returned None. Check project configuration.")
+        if not hasattr(client, "__aenter__"):


We should probably use is instance(client, AsyncOpenAI) here because I don't think the underlying openai sdk exposes __aenter__.

Copilot on behalf of alliscode: Fixed in b593126 — replaced hasattr(client, '__aenter__') with isinstance(client, AsyncOpenAI). Also moved the AsyncOpenAI import from TYPE_CHECKING to a runtime import since isinstance needs it at runtime. Updated all tests to use MagicMock(spec=AsyncOpenAI) for proper type matching.

moonbox3 · 2026-03-26T05:43:50Z

python/packages/core/agent_framework/_evaluation.py

+
+Typical usage — cloud evaluator::
+
+    from agent_framework import evaluate_agent, EvalResults


Same code example comment as before - probably need to update it in other places, too?

Copilot on behalf of alliscode: Fixed in b593126 — applied the same .. code-block:: python format to all 13 code example blocks across _evaluation.py (module docstring, ConversationSplitter, EvalResults, keyword_check, tool_called_check, tool_calls_present, tool_call_args_match, evaluator, LocalEvaluator, evaluate_agent, evaluate_workflow).

moonbox3 · 2026-03-26T05:45:42Z

python/packages/core/agent_framework/_evaluation.py

+        responses_list = [response] if isinstance(response, AgentResponse) else list(response)
+        queries_norm = [str(q) for q in _normalize_queries(query, len(responses_list))]
+
+    return await evaluate_agent(


This passes queries=None to evaluate_agent when query isn't provided, which hits a ValueError("Provide 'queries' alongside 'responses'..."). The docstring implies query is optional, but omitting it always errors?

Copilot on behalf of alliscode: Fixed in 3dbedc3 — evaluate_response now auto-extracts queries from the response messages when query is not provided, instead of passing None to evaluate_agent. It collects user-role messages from each response to use as the query text.

moonbox3 · 2026-03-26T05:46:37Z

python/packages/core/agent_framework/_evaluation.py

+            executor_id = event.executor_id
+
+            # Skip internal framework executors
+            if executor_id.startswith("_") or executor_id.lower() in {"input-conversation", "end-conversation", "end"}:


We're not logging anything or throwing an error here if an executor ID starts with _ - is that intended?

Copilot on behalf of alliscode: This is intentional — executors with _-prefixed IDs are internal framework executors (e.g. workflow infrastructure) that shouldn't be evaluated as user-facing agents. Added a logger.debug() call in 3dbedc3 so it's visible when debug logging is enabled.

- Convert all Example:: / Typical usage:: code blocks to .. code-block:: python format matching codebase convention (both _evaluation.py and _foundry_evals.py) - Add async pagination in _fetch_output_items via async for (handles large result sets) - Replace hasattr(__aenter__) with isinstance(client, AsyncOpenAI) in _resolve_openai_client - Move AsyncOpenAI import from TYPE_CHECKING to runtime (needed for isinstance) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix tests: use MagicMock(spec=AsyncOpenAI) for project_client mocks (isinstance check now requires proper type, not duck-typing) - Fix tests: replace mock_page.__iter__ with _AsyncPage helper for async for - Fix evaluate_response: auto-extract queries from response messages when query is not provided (previously always raised ValueError) - Add debug logging when skipping internal _-prefixed executor IDs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 17, 2026

github-actions bot changed the title ~~Foundry Evals integration for Python~~ Python: Foundry Evals integration for Python Mar 17, 2026

alliscode force-pushed the af-foundry-evals-python branch from a0edd5f to fe9e621 Compare March 17, 2026 21:21

eavanvalkenburg reviewed Mar 18, 2026

View reviewed changes

alliscode force-pushed the af-foundry-evals-python branch 6 times, most recently from 15d8640 to aad92ac Compare March 19, 2026 20:41

alliscode force-pushed the af-foundry-evals-python branch 5 times, most recently from 901ea59 to d52c85e Compare March 24, 2026 20:03

alliscode and others added 13 commits March 25, 2026 08:06

fix: add nosec B101 for bandit assert check

c36a27b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix ruff ISC004 lint error and apply formatter

3984e44

- Wrap implicit string concatenation in parens in evaluate_multiturn_sample.py - Apply ruff formatter to 6 other files with minor formatting drift Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove core type changes (extracted to fix/workflow-stale-session bra…

603bc69

…nch) Reverts changes to _agents.py, _agent_executor.py, and _workflow.py back to upstream/main. These fixes are now in a separate PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alliscode and others added 8 commits March 25, 2026 08:07

Fix error message: evaluate_responses() → evaluate_traces(response_id…

299936f

…s=...) The referenced function doesn't exist; the correct API is evaluate_traces(response_ids=...) from the azure-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reduce default eval timeout from 600s to 180s (3 minutes)

f37e72b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove dead _evaluate_via_responses method from FoundryEvals

ed4c55d

The method was never called — evaluate() uses _evaluate_via_dataset, and evaluate_traces() calls _evaluate_via_responses_impl directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert unrelated formatting changes to get-started samples

8a2c237

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alliscode force-pushed the af-foundry-evals-python branch from dc01030 to 8a2c237 Compare March 25, 2026 15:08

alliscode force-pushed the af-foundry-evals-python branch from 36c21b1 to f439746 Compare March 25, 2026 15:38

alliscode force-pushed the af-foundry-evals-python branch from a74c9d1 to 8d8b6e8 Compare March 25, 2026 17:55

alliscode and others added 3 commits March 25, 2026 11:15

Fix lint errors in eval samples (E501, ASYNC240, formatting)

24533a7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alliscode marked this pull request as ready for review March 25, 2026 19:43

Copilot AI review requested due to automatic review settings March 25, 2026 19:43

Copilot started reviewing on behalf of alliscode March 25, 2026 19:46 View session

Remove evaluate_all_patterns_sample.py (redundant with focused samples)

b050004

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI reviewed Mar 25, 2026

View reviewed changes

alliscode and others added 2 commits March 25, 2026 12:56

Merge branch 'main' of https://github.com/microsoft/agent-framework i…

86ab652

…nto af-foundry-evals-python

alliscode force-pushed the af-foundry-evals-python branch from d266ee2 to 997a379 Compare March 25, 2026 20:01

Revert test_observability.py to upstream/main (not our test)

73b3de0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

moonbox3 reviewed Mar 26, 2026

View reviewed changes

alliscode and others added 2 commits March 26, 2026 09:21


		Typical usage — cloud evaluator::

		from agent_framework import evaluate_agent, EvalResults

Conversation

alliscode commented Mar 17, 2026

Contribution Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markwallace-microsoft commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

markwallace-microsoft commented Mar 19, 2026 •

edited

Loading